Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
arxiv.org·2d
RecUserSim: A Realistic and Diverse User Simulator for Evaluating Conversational Recommender Systems
arxiv.org·2d
The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models
arxiv.org·2d
LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
arxiv.org·3d
Loading...Loading more...